-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage by GroupReadsByUmi in a corner case #774
Conversation
… there are many reads with the same start/stop and edits=0.
Codecov Report
@@ Coverage Diff @@
## master #774 +/- ##
==========================================
- Coverage 95.57% 95.50% -0.08%
==========================================
Files 119 119
Lines 6805 6830 +25
Branches 476 450 -26
==========================================
+ Hits 6504 6523 +19
- Misses 301 307 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
logger.warning(s"Read (${rec.name}) detected with unexpected length UMI(s): ${sequences.mkString(" ")}.") | ||
logger.warning(s"Expected UMI length: ${umiLength}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the unrelated change here. I had a stupid type that took me far too long to figure out because I used -u
instead of -U
and CorrectUmis
happily decided my filename was the sole UMI sequence to correct to. It the message here had told me my expcted UMI length was 30+ that would have helped!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a subtle change to the behavior which we could think of as an improvement, but it does change the behavior, so we should discuss.
iterator.hasNext && | ||
firstEnds == ReadInfo(iterator.head.r1.get) && | ||
// This last condition only works because we put a canonicalized UMI into rec(assignTag) if canTakeNextGroupByUmi | ||
(!canTakeNextGroupByUmi || firstUmi == iterator.head.r1.get.apply[String](this.assignTag)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this does change behavior in a subtle way. Suppose we have three templates that have all but the same assign tag. Let's say AAAA
, GGGGG
, GGGGT
, with --min-umi-length=3
.
Previously, all templates would be read into memory, and truncateUmis
would truncate to the length of smallest UMI observed in the group of templates, in this case 4bp long due to the AAAA
(not 3 as per the command line!). So the three templates would have UMIs AAAA->AAAA
, GGGGG->GGGG
, and GGGGT->GGGG
. So we'd assign two unique molecules (AAAA
and GGGG
, with the last molecule containing the last two reads).
In the new implementation, we truncate the raw UMI bases based on --min-umi-length=3
to set MI
for sorting. So we'd truncate to length 3 for sorting: AAAA->AAA
, GGG->GGG
, and GGG->GGG
. When we read back in after sorting, we read in the first template by itself (only read with MI:AAA
), so no truncation is applied and it stays the same (AAAA
). We then read in all templates with MI having GGG
, which gives the second two templates. Now we go back to the raw UMIs to find the length of the shortest UMI of the two. Both are 5bp long, so we do not truncate and keep them the same (GGGGG
and GGGGT
). But now these UMIs differ, so we get three molecules!
One could argue that the new implementation is an improvement, but it does change behavior subtly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, that is an interesting point. So basically if you have variable length single UMIs and have edits = 0 the behavior will be subtly different. My instinct is to call this an improvement and move on, but I don't have a great sense of who (or on what kind of data) the variable length support is used, so I'm not 100% sure.
Co-authored-by: Nils Homer <[email protected]>
No description provided.